-
Notifications
You must be signed in to change notification settings - Fork 476
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Clean up bot regexes #7549
Clean up bot regexes #7549
Conversation
Bot version parsing in bot regexes is consistently done using `[\d.]+`. In particular, the following changes are made: - drop wrapper parentheses, where it is used, as the bot version is not exposed by the parser, and there is inconsistent use of grouping parentheses in the various regexes; - `[\d+.]` is assumed to be a typo picked up from the use of `\d+.` outside character classes; - `[\d+\.]` additionally uses a superfluous escape on the dot, which is not needed inside character classes; - plain `\d` or `[0-9]` are replaced with the common expression, as they will match the same agent strings, but include more of the version string in the match–mainly, this drives consistency; The main change that should arise here is that the plus character will no longer be recognised as part of the bot/version match. The tests don't suggest that this will be an issue. (If needed, the `+` can be brought back, while keeping consistency)
Given that the groupings don't have a meaning, there's little point in keeping them as part of the regexes. In addition, superfluous parentheses are dropped.
The wildcard matches extend the match, but don't add value as the match is already made at that point.
This avoids the need to escape the dash in character classes.
6f6ac3e
to
9dc9bda
Compare
This is a really nice pr! |
@biochimia pls resolve conflict for PR |
@sanchezzzhak can I resolve the conflicts, so we can get this PR merged? |
yes, (you may need the rights of a Collaborator)
@sgiehl You can grant the same Collaborator rights for liviuconcioiu? |
Sorry for the radio silence, as I haven't been able to devote any time to this. I don't object to @liviuconcioiu or anyone else picking up the changes and addressing the conflicts as needed—if you have the opportunity. I assume that the changes will need to be (at least partially) redone, in order to address the same issues in any new or updated regexes. If you don't beat me to it, I will try to find some time later this week to apply the changes on a fresh pull. |
I'll send an invite. |
Thanks both for help! |
Description:
The changes in this pull request clean up regexes used to identify bots. Each change is performed in its own commit with a description of the change. Overall, the changes make the expressions stricter and more consistent, thus reducing the cognitive load for anyone perusing them.
In summary, these are the changes introduced:
\.
where an actual dot is expected, such as in URLs;[\d.]+
, whereas previously there were a few different ways bot versions were parsed essentially to the same effect;-
at end of character classes.Review